Data Transformation using Unity Catalog

In this topic we will discuss how to create a data transformation pipeline using Unity Catalog-enabled Databricks for data transformation and Databricks Unity Catalog as data lake.

Prerequisites

Ensure that you complete the following prerequisites, before you create a data transformation job:

Access to a Databricks node that has Unity Catalog enabled which will be used as a data transformation node in the pipeline. The Databricks Runtime version of the cluster must be 14.3 LTS or later.The access mode must be dedicated or standard.
Access to a Databricks Unity Catalog node which will be used as a data lake in the pipeline.

Creating a data transformation job

On the home page of Data Pipeline Studio, add the following stages and connect them as shown below:
- Data Transformation (Databricks - Unity Catalog enabled)
- Data Lake (Databricks Unity Catalog)
Configure the data lake node.
- Click the dropdown Use an existing Databricks Unity Catalog, select an instance. Click Add to data pipeline.
- Click the dropdown Schema Name and select a schema.
- Click Data Browsing. Browse the folders and view the required files. This step is optional.
- Click Save.

Click the data transformation node. Do the following:

Select one of the following options that you want to use for the data transformation job:
- Spark Cluster
- SQL Warehouse
See Spark Cluster or SQL Warehouse
Click Create Templatized Job.

Complete the following steps to create the job:

Source

Source - This is automatically selected depending on the data source node configured and added in the pipeline.
Datastore - The configured datastore that you added in the data pipeline is displayed.
Catalog Name - This is automatically selected depending on the configured datastore that you added in the pipeline.
Schema Name - The schema name is displayed based on the configuration. You can click the dropdown list to select a different schema.
Source Table - Select the table from the source that you want to use for transformation.

Data Processing Type - Select the type of processing that must be done for the data. Choose from the following options:

Delta - in this type of processing, incremental data is processed. For the first job run, the complete data is considered. For subsequent job runs, only delta data is considered.

If you select this option, you must provide the following information:

Option	Description
Table Versions (Change Feed Data)	Unity Catalog stores data in delta format, this feature is called Change Feed Data. Unity Catalog creates versions of tables when there is a change in data. Select this option if you are using the delta format. Select column from the dropdown list.
Table Columns	Unity Catalog supports tables like CSV, JSON, Parquet where data is stored in tables. Select this option if you are not using the delta format. Select a column based on which the delta data can be identified.

Option

Description

Table Versions (Change Feed Data)

Unity Catalog stores data in delta format, this feature is called Change Feed Data. Unity Catalog creates versions of tables when there is a change in data. Select this option if you are using the delta format.

Select column from the dropdown list.

Table Columns

Unity Catalog supports tables like CSV, JSON, Parquet where data is stored in tables. Select this option if you are not using the delta format.

Select a column based on which the delta data can be identified.

Full - in this type of processing, the complete data is considered for processing in each job run.

Click Add to add the tables and data processing type for each table.

Transformation

In this step, you define the type of operations that you want to perform on the selected tables and provide the required details.

Type of Operation	Details to be provided
Filter	Provide the options for a where clause: Table - select a table on which you want to run the query. Where Column - select a column from the table to run the where clause. Operator - select an appropriate operator to add to the where clause. Value - provide a value to add to the where clause.
Join	Provide the options for a join operation: Table - select the first table for the join operation. Alias Name - provide an alias name for the table, else the default name is considered. Define Joins Join Type - select a join type. See Table 2 - select the second table on which the join operation is to be performed. Alias name - provide an alias name for the table, else the default name is considered. On - select a column based on which the join operation is performed. Join Operator - equal to operation (=). Field - select a column from the second table on which the join operation is to be performed. + Add Filter - Click this option if you want to filter the records based on a certain condition after the join operation is performed on the selected tables. Column - Select a column of which you want to apply the filter. Operator - Select an operator from the available options. Value - Enter the value for the condition. Where + New Clause - Add another where clause on the data after the join operation and any additional filters that you applied. Column - Select a column of which you want to apply the filter. Operator - Select an operator from the available options. Value - Enter the value for the condition. Click Add. Query - The SQL query is generated based on the join operation and the applied filters. Join Query - Click the down arrow to view the generated query. Query Columns and Data Preview - You can select a SQL warehouse to run the generated query, fetch the query columns, and view the data preview. SQL Warehouse - Select a SQL warehouse from the dropdown list to run eh generated SQL query. Data Preview - View the output of the SQL query.
Union	In this operation type, two tables are combined. Table 1 - Select one table to perform the union operation. Table 2 - Select second table to perform the union operation. Query - The query for the union operation is generated. Click Add.
Aggregation	In this type of operation, a function is run on a set of values to create a single value. Table - Select a table to perform aggregation. Create Aggregation Function - Select a function from the dropdown list. Column - Select a column from the dropdown list, on which you can run the aggregate function Group By - Query - The generated query is displayed. Aggregation Query - Click the down arrow to view the generated aggregation query. Query Columns and Data Preview - You can select a SQL warehouse to run the generated query, fetch the query columns, and view the data preview. SQL Warehouse - Select a SQL warehouse from the dropdown list to run eh generated SQL query. Data Preview - View the output of the SQL query.

Type of Operation

Details to be provided

Filter

Unity Catalog Data Transformation Filter Option

Provide the options for a where clause:

Table - select a table on which you want to run the query.

Where
Column - select a column from the table to run the where clause.
Operator - select an appropriate operator to add to the where clause.
Value - provide a value to add to the where clause.

Join

Unity Catalog Data Transformation Join Query

Provide the options for a join operation:

Table - select the first table for the join operation.
Alias Name - provide an alias name for the table, else the default name is considered.

Define Joins

Join Type - select a join type. See
Table 2 - select the second table on which the join operation is to be performed.
Alias name - provide an alias name for the table, else the default name is considered.
On - select a column based on which the join operation is performed.
Join Operator - equal to operation (=).
Field - select a column from the second table on which the join operation is to be performed.

+ Add Filter - Click this option if you want to filter the records based on a certain condition after the join operation is performed on the selected tables.

Column - Select a column of which you want to apply the filter.
Operator - Select an operator from the available options.
Value - Enter the value for the condition.

Where

+ New Clause -

Add another where clause on the data after the join operation and any additional filters that you applied.

Column - Select a column of which you want to apply the filter.
Operator - Select an operator from the available options.
Value - Enter the value for the condition.

Click Add.

Query - The SQL query is generated based on the join operation and the applied filters.

Join Query - Click the down arrow to view the generated query.

Query Columns and Data Preview - You can select a SQL warehouse to run the generated query, fetch the query columns, and view the data preview.

SQL Warehouse - Select a SQL warehouse from the dropdown list to run eh generated SQL query.
Data Preview - View the output of the SQL query.

Union

Unity Catalog Data Transformation Union Job

In this operation type, two tables are combined.

Table 1 - Select one table to perform the union operation.
Table 2 - Select second table to perform the union operation.
Query - The query for the union operation is generated.

Click Add.

Aggregation

Unity Catalog Data Transformation Agrregation

In this type of operation, a function is run on a set of values to create a single value.

Table - Select a table to perform aggregation.

Create Aggregation

Function - Select a function from the dropdown list.
Column - Select a column from the dropdown list, on which you can run the aggregate function
Group By -
Query - The generated query is displayed.

Aggregation Query - Click the down arrow to view the generated aggregation query.

Query Columns and Data Preview - You can select a SQL warehouse to run the generated query, fetch the query columns, and view the data preview.

SQL Warehouse - Select a SQL warehouse from the dropdown list to run eh generated SQL query.
Data Preview - View the output of the SQL query.

Target

Target - - The datastore that is used as the target in the data transformation job.
Datastore - The datastore that you configured for the Databricks Unity Catalog is auto-populated.
Catalog Name - The catalog name that is selected for the Unity Catalog instance is auto-populated.
Schema - The schema that is selected for the Unity Catalog instance is auto-populated. If required you can change the schema at this stage.
Map source data to target tables - Map the source file with a table in the target. You can either map an existing table or create a new table and map it.
- Source - Select an option from the dropdown.
- Target Table - Select a table to store the results of the data transformation job. Alternately create a new table.

Click Add Mapped Tables. To delete a mapping click the ellipsis (...) and then click Delete.

Click Next.

Schema Mapping

In this step you define the schema of the target table by using various options provided in this stage:

Mapped Data - This provides the mapping of the source file with the target table that was done in the previous stage.
Schema Mapping for Target Tables
- Filter columns from selected tables- You deselect columns as per your use case and run a constraint on each column.
  - Deselect columns that are not required, from the list of columns that is populated, and provide custom names to certain columns.
  - Select a constraint from the options: Set Not Null and Check. For the check constraint you must provide a SQL condition that needs to be checked for the column.
Continue job run even if constraints are not met - This toggle when enabled, ensures that the job run is continued even if a constraint is not met.
Add Custom Columns - Enable this option to add additional columns apart from the existing columns of the table. To add custom columns, do the following:

Click Add Custom Column after providing the details for each column. Repeat the steps for the number of columns that you want to add.

Column Name - Provide a column name for the custom column that you want to add.
Type and Value - Select the parameter type for the new column. Choose from the following options:
- Static Parameter - Provide a static value that is added for this column.
- System Parameter - Select a system-generated parameter from the dropdown list that must be added to the custom column.
- Generated - Provide the SQL code to combine two or more columns to generate the value of the new column.

Click Add Custom Column after adding the details for each column.
Repeat steps 1-3 for the number of columns that you want to add. After adding the required custom columns, click Add Column Mapping Details.
To review the column mapping details, click the ellipsis (...) and click View Details.
Click Add Schema Mapping.
Click Next.

Data Management

In this step you select the operation type that you want to perform on the source table and the partitioning that you want to create on the target table.

Mapped Data - Select the mapped source and target tables.
Operation Type - Select the operation type to perform on the source data. Choose one of the following options:
- Append - Adds new data at the end of a file without erasing the existing content.
- Merge - For the first job run, the data is added to the target table. For subsequent job runs, this operation type merges the data with the existing data when there is a change in the source data.
  
  Unique Column Name - The merge operation is done based on the unique key column that you provide. Select a column name from the dropdown list.
- Overwrite - Replaces the entire content of a file with new data.
Enable Partitioning - Enable this option if you want to use partitioning for the target data. Select from the following options:
- Data Partition - Select column name from the dropdown list. Select a column from the dropdown list. Click Add.
- Date Based Partition - Select the type from the options - yearly, monthly , or daily. Provide the prefix that you would like to add to the partition. Adding a prefix is optional.
Click Add. The Data mapping for mapped tables displays the mapping details. Click the ellipsis (...) to edit or delete the mapping.

Cluster Configuration

You can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Calibo Accelerate platform and you want to update custom environment variables, refer to the following:

How to update custom environment variables for a Databricks cluster that is not created through the Calibo Accelerate platform

Job Cluster

Cluster Details	Description
Choose Cluster	Provide a name for the job cluster that you want to create.
Job Configuration Name	Provide a name for the job cluster configuration.
Databricks Runtime Version	Select the appropriate Databricks version.
Worker Type	Select the worker type for the job cluster.
Workers	Enter the number of workers to be used for running the job in the job cluster. You can either have a fixed number of workers or you can choose autoscaling.
Enable Autoscaling	Autoscaling helps in scaling up or down the number of workers within the range specified by you. This helps in reallocating workers to a job during its compute-intensive phase. Once the compute requirement reduces the excess number of workers are removed. This helps control your resource costs.
Cloud Infrastructure Details
First on Demand	Provide the number of cluster nodes that are marked as first_on_demand. The first_on_demand nodes of the cluster are placed on on-demand instances.
Availability	Choose the type of EC2 instances to launch your Apache Spark clusters, from the following options: Spot On-demand Spot with fallback
Zone	Identifier of the availability zone or data center in which the cluster resides. The provided availability zone must be in the same region as the Databricks deployment.
Instance Profile ARN	Provide an instance profile ARN that can access the target Amazon S3 bucket.
EBS Volume Type	The type of EBS volume that is launched with this cluster.
EBS Volume Count	The number of volumes launched for each instance of the cluster.
EBS Volume Size	The size of the EBS volume to be used for the cluster.
Additional Details
Spark Config	To fine tune Spark jobs, provide custom Spark configuration properties in key value pairs.
Environment Variables	Configure custom environment variables that you can use in init scripts.
Logging Path (DBFS Only)	Provide the logging path to deliver the logs for the Spark jobs.
Init Scripts	Provide the init or initialization scripts that run during the start up of each cluster.

Notifications

SQS and SNS
Configurations - Select an SQS or SNS configuration that is integrated with the Calibo Accelerate platform.
Events Select All Node Execution Failed Node Execution Succeeded Node Execution Running Node Execution Rejected
Event Details - Select the details of the events for which notifications are enabled.
Additional Parameters - provide any additional parameters to be considered for SQS and SNS queues.

What's next? Snowflake Custom Transformation Job